Univariate plot section

Loan Volume by months

I want to first check if there is any variation in loan volumes.Since this data is US based I want to check if volumes increase or decrease during certain months i.e during the holiday season,Thanksgiving etc, Since people go on a shopping spree during these months defaults on loans may also increase during this period is this the case?

Increase in loan volume during holiday season Oct,Nov,Dec,Jan can be noticed .

When are most loans defaulted?

This distribution is similar to loan volumes by months so defaults seem to increase during holiday season too

Loan Volumes over the years

I want to check if the loan volumes have changed over the years.Prosper started in 2005 GFC occured in 2008 so did the volumes change over that period.

It appears loans peaked in 2013 then in the year 2009-2010 dropped off and then started picking up again.Note we do not haveall the data for 2014.

It will be interesting to understand why volumes fell in 2009

Where are the most loans taken

Which state uses Prosper and takes out most loans.

From the graph CA takes out most loans

How long are the terms of the loan?

Do people take short term loans or do they take long term loans

Most loans are 36 months

How much monthly loan repayment?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

The loan payment spread is positively skewed with the most common repayment about $200

What interest are people borrowing the loans at?

The distribution of borrowerAPR is slightly positively skewed and there is spike about 0.35%

What is the prosperscore of the loans.

Prosper score are a custom risk score built using historical Prosper data applicable for loans originated after July 2009. Most loans from the graph seem to be between 4-8 prosper score I guess relatively high prosper scores should predict a good loan outcome . It would be intersting to see if low prosper scores give higher lender yield and vice versa and also if prosper scores that are high predict a good loan outcome

What is the most common reason for the loan

It would be intersting to understand why people are using P2P lending as opposed to Banks. For lenders its obvious the yields are higher though risks should be higher and due deligence work will be higher for for borrowers the Reasons I could include turnaround time ,low credit scores ,lower repayments .

Most loans are for debt consolidation as seen in graph

Debt to income ratio of the borrowers

Most people take on reasonable debt but there are few borrowers who take on very large debt.

Income range of the borrowers

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

Largest group of borrowers have an income range of 250000-50000 closely followed by 50000-75000

What is the lender yield like?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

Lender yield spread is positively skewed large percentage of loans yield close to 0.3%

Univariate Analysis

What is the structure of your dataset?

Data has 113937 observation with 81 variables ###What is/are the main feature(s) of interest in your dataset? This data set has 81 variables so I chose a subset of the data as features to study.The features chosen are

BorrowerState LoanOriginalAmount BorrowerAPR ProsperScore LenderYield CreditScoreRangeLower

Did you create any new variables from existing variables in the dataset?

I created 6 variables Loan_year,Loan_month,Loan_closed_Date_month, Loan_closed_Date_year,Credit_Type,TotalMonthlyDebt

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Other features I have used to futher support my investigation are .

LoanOriginationDate ClosedDate Term BorrowerAPR ProsperScore ListingCategory Occupation EmploymentStatus EmploymentStatusDuration CurrentCreditLines DebtToIncomeRatio IncomeRange StatedMonthlyIncome Recommendations

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

The plot of Loan volumes over the years shows that loans have abruptly fallen in 2009 which was unusual since it was growing year on year from 2005.

Bivariate Plots Section

Have lending criteria been stricter

ggplot(aes(x=factor(Loan_year),fill=Credit_Type),data=loan_data_csv)+
geom_histogram()+
xlab("Loan_year")+
ggtitle("Borrower profile over the years")

From the graph we can see that credit requirements have become more stricter from 2009 onwards loans are given to borrowers with atleast fair credit.

Loan deliquency over time

The plot shows default was very high in 2006 and then has fallen steadily after 2009 to 2013 The next question would be why such a dramatic change i.e why have defaults fallen so much have lending standards improved

Does loan’s prosper score affect loan Interest rate?

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$ProsperScore and loan_data_csv$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6719940 -0.6645469
## sample estimates:
##        cor 
## -0.6682872

Prosperscore and BorrowerAPR are strongly negatively corelated

ProsperScore and BorrowerAPR are very negatively correlated The graphs Prosper score vs BorrowerAPR and BorrowerAPR spread both indicate that Interest rate fall for loans with high prosperscore and vice versa

LoanOriginalAmount vs ProsperScore

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$LoanOriginalAmount and loan_data_csv$ProsperScore
## t = 80.475, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2600308 0.2725335
## sample estimates:
##       cor 
## 0.2662933

Both loan original amount and prosperscore have a moderate corelation this indicates Prosperscore increases these loans could potentially have a larger loan amount

Plots indicate Loans with low prosperscores have low amounts and those with higher prosper scores can have higher loan amounts

Does credit score infulence borrower APR ?

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$CreditScoreRangeLower and loan_data_csv$BorrowerAPR
## t = -160.21, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4344422 -0.4249487
## sample estimates:
##        cor 
## -0.4297073

Both creditscore and borrower interest rate are strongly negatively correlated indicating that borrowers with low credit score probably pay more interest and borrowers with good credit score pay less interest

The general trend is that as creditscore increases borrower APR decreases as evidenced by the graph.

Income vs Loan Amount

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$StatedMonthlyIncome and loan_data_csv$LoanOriginalAmount
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1956816 0.2068243
## sample estimates:
##       cor 
## 0.2012595

Both monthly income and loan amount are positively correlated this suggests that people on higher incomes can potentially take on bigger loans

Plots show that people who are on larger incomes can take on larger loans

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$Term and loan_data_csv$LoanOriginalAmount
## t = 121.6, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3337778 0.3440569
## sample estimates:
##       cor 
## 0.3389275

LoanOriginalAmount and loan term are positively correlated which implies larger loans are taken over a longer period

The above plot show that longer term loans are usually larger loans

More risk more reward? Lender yield vs Creditscore

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$CreditScoreRangeLower and loan_data_csv$LenderYield
## t = -171.71, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4589577 -0.4497179
## sample estimates:
##      cor 
## -0.45435

CreditScore and Lenderyield negatively corelated implying as Credit Scores increases lender yield decreases and viceversa

As credit score increases lender yield trends down as evidenced by the plot

Lender yield and prosper score

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$ProsperScore and loan_data_csv$LenderYield
## t = -249.01, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6536541 -0.6458788
## sample estimates:
##        cor 
## -0.6497835

Lender yield and prosper score are highly negatively correlated implying as prosper score increases lender yield decreases and vice versa

The above plot reaffirms that for loans with high prosper scores the yield falls.

Lender yield vs borrower apr

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$LenderYield and loan_data_csv$BorrowerAPR
## t = 2291.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9892049 0.9894515
## sample estimates:
##       cor 
## 0.9893289

Both lender yield and Borrower APR are very highly corelated . The relationship is pretty linear

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

There is a strong negative relationship between prosperscore and borrowerapr meaning loans with higher scores have lower interest rate.

There is a very strong positive corelation between lender yield and borrowerAPR. This implies that loans where lender yield increases have larger interest rates

There is a strong negative corelation between prosperscore and lender yield This implies that loans with good prosper scores have lesser yield and viceversa

There is a strong negative correlation between creditscore and borrowerapr i.e people with bigger creditscores get cheaper loans

LoanAmount and term of loans are positively correlated meaning Larger loans are taken over a longer period of time

There is a Negative correlation between credit scores and lender yield implying more risk more reward

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

I found that the lending criteria of prosper has become more stringent . Loans are given only to people with reasonably good Credit scores. I also noticed default rates have fallen considerably over the years .

What was the strongest relationship you found?

The most strongest relationship is between BorrowerAPR and lenderyield. LenderYield is high then borrowerAPR is high and viceversa. This could be true since LenderYield is defined as interest rate less service fee i.e BorrowerAPR in some part determines lenderyield

Multivariate plots section

Prosper business how much have volumes grown in each state over the years

Some states have an abrupt distribution like IA,ME,ND after further research I found that these states have disallowed prosper. CA seems to have the most loans followed by NY ,TX,GA,FL. RI,NV,SD dont have data for 2005-2008 Prosper was introduced here after 2008.

Lenderyield versus DebtToIncomeRatio

The money is in risky investments as evidenced by graph yield is high where the DebtToIncomeRatio>1 .The prosper score of these loans is low.

BorrowerAPR versus LoanAmount

## 
##  Pearson's product-moment correlation
## 
## data:  loan_data_csv$BorrowerAPR and loan_data_csv$LoanOriginalAmount
## t = -115.14, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3280787 -0.3176752
## sample estimates:
##        cor 
## -0.3228867

Good credit scores are able to borrow at a low interest rate on larger loans No particular relationship between loan amount and borrowerAPR

Loan original amount vs Income

From the above plots it can be concluded people who are employed and on a relatively highwage with good credit score take on higher debts.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths

and limitations of your model.

lr<-lm(LenderYield~BorrowerAPR+ProsperScore+
         CreditScoreRangeLower+
         DebtToIncomeRatio,data=loan_data_csv)

summary(lr)
## 
## Call:
## lm(formula = LenderYield ~ BorrowerAPR + ProsperScore + CreditScoreRangeLower + 
##     DebtToIncomeRatio, data = loan_data_csv)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.070892 -0.004182 -0.000523  0.005379  0.021991 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           -5.666e-02  6.078e-04  -93.233   <2e-16 ***
## BorrowerAPR            9.531e-01  5.595e-04 1703.592   <2e-16 ***
## ProsperScore           7.480e-04  1.704e-05   43.898   <2e-16 ***
## CreditScoreRangeLower  3.202e-05  7.520e-07   42.582   <2e-16 ***
## DebtToIncomeRatio     -3.079e-04  9.432e-05   -3.264   0.0011 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.008245 on 77552 degrees of freedom
##   (36380 observations deleted due to missingness)
## Multiple R-squared:  0.9876, Adjusted R-squared:  0.9876 
## F-statistic: 1.538e+06 on 4 and 77552 DF,  p-value: < 2.2e-16
#r^2 is 0.9876 .The linear model is very good at predicting lenderyield
#as evidenced by the R^2. The variables are significant hence I have included
#them all .Some of the independent variables have a high corelation among
#them so there could be a multicolinearity problem

library(car)
vif(lr)
##           BorrowerAPR          ProsperScore CreditScoreRangeLower 
##              2.237766              1.848829              1.435037 
##     DebtToIncomeRatio 
##              1.028572
#The VIFs are not too large so the model does not exhibit multi colinearity

A linear models is built predicting Lender yield using Borrower APR. The linear model has an R^2 of 0.9876 which is pretty good. All independent variables are significant

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of

looking at your feature(s) of interest?

I observed LoanAmounts versus monthly income using features IncomeRange, EmploymentStatus ,Creditscore to understand the relationship further. After looking at the plots I could conclude that people who are employed with a good salary and reasonable credit scores take on larger loans. I also found that lender yield increases in risky investments . I also observed that Prosper’s lending criteria has become more stricter to what it was a few years back.

Were there any interesting or surprising interactions between features?

I observed loan volumes fell off in 2009 .After searching online I found that the SEC had put a cease and desist order on Prosper in Nov 2008.It also appears from the plots that Prosper have made their lending criteria more stringent from the time they started they seem to give loans only to people with good credit history.

Final Plots and Summary

Prosper loan volumes by state

Loan volumes have increased drastically since 2005 with a dip in 2009.The plot also showed that prosper did not launch in all states simultaneously .In some states it started later like Rhode Island,nevada and south dakota and in some states its still not available like Maine, Iowa, and North Dakota.

Tightening of Criteria for granting loans

Borower profile in terms of credit score lower has changed since 2006 to 2014 In 2006 we did have some low <500 loans and in 2014 there are no such loans all are above atleast 600 this could account for more defaults early on .

Lender yield vs Borrower APR

The above plot shows that lender yield increases as borrowerAPR of loan increases.Notice that the relationship is linear as evidenced by the red line The higher lender yield also corresponds to more riskier loans as evidenced by the color of the points .

Reflection

This data set is pretty large with many different variables. My first difficulty was understanding how the business peer to peer lending worked then I tried to understand what the various variables in the data set meant .Initially I choose far too many features then slowly I brought that down to a few main ones. Using EDA I then tried to explore their relationships. I wanted to understand why a lender/borrower would opt for p2p lending rather than go to a bank. It would be useful if I could study what the investor return would be using p2p and a brick and mortar bank or bonds or shares. Similarly for a borrower what the interest rate would be for p2p and a standard bank.

I then tried to understand what drives the lender yield.The data inconculsively shows that like all equities risky behaviour is rewarding I then tried to model what determines lender yield. My model has very few variables and a good R^2.I notice that the independent variables are very well corelated so I try to check for multi collinearity by calculating VIF(Variance Inflation Factors).These turn out to be reasonable so I include all the independent variables in the model.It would also be v interesting to model Prosper score.On what basis does Prosper allocate Prosperscore to its loans .I did try modelling the same using a linear model but my R^2 was not very good 0.63.

I can conclude that a platform like Prosper gives a good returns to investors the number of defaults over the years have fallen since granting of loans is screened more and only worthy borrowers are given loans. Since the P2P lending space has become more competitive it would be interesting to see if the returns that the investors are currently giving will continue or not.